The project will focus on TCGA RNA seq data analysis and machine learning techniques on Liver Hepatocellular Carcinoma patient data to find genes that are ultra-sensitive to cancerous genes, i.e. genes that drastically vary in concentration with respect to small changes in the concentration of the cancerous genes. Priority will be given to the genes that play a major role in endothelial membrane receptors since they will most likely affect nanoparticles’ pharmacodynamic and pharmocokinetic properties in the blood vessel.
With the accelerating development in the application of Nanomaterial in medicine, next generation pharmacodynamics and pharmacokinetic modeling and simulation can accelerate the design, validation and translation of targeted nanoparticles by achieving selective targeting of affinity-ligand coated nanoparticles. Within the last 10 years, the number of FDA approved nanomaterials have increased from 2 to almost 50 different types. The types of nanoparticles can range from inorganic to organic, which results in numerous ways to perform drug discovery, delivery and development. Case studies of companies and technology in this market space reveals a three-part structure for commercialization of nanoscale drug delivery platforms: A technology to focus on (DNA cages, liposomes, etc.), a specific pipeline where they combine their developed drug and the nanomaterial to produce a final product and finally, a disease to target. One of the main challenges revolves around the design and validation of the nanoparticle and the drug for optimal performance and therapeutic efficacy. This can be facilitated by developing a qualitative and exhaustive computational platform for modeling the targeting of functionalized nanocarriers to optimize experimental design protocols for drug delivery.
The goal of this research is to extend the previously developed physiologically-based pharmacokinetic modeling of nanoparticle based drug delivery system to better model the experimental observations (Part 1) and to integrate TCGA genetic analysis and the pharmacodynamic simulation to reflect targeted membrane protein expression in human cancers (Part II). This R project will focus on Part II of the research while Part I will be coded in python.
library("DESeq2")
library(rbokeh)
library(limma)
library("pheatmap")
library("RColorBrewer")
library(ggfortify)
library(GenomicDataCommons)
library(ggplot2)
library(knitr)
library(kableExtra)
library(jsonlite)
library(listviewer)
library(survival)
library(rnaseqGene)
library(devtools)
In TCGA database, there are 4 metadata projects(), cases(), files() and annotations().
The JSON table gives very good insight into how the complicated database is structured from the nodes of the metadata.
paste("Total number of projects available in TCGA = ", GenomicDataCommons::count(projects()))
## [1] "Total number of projects available in TCGA = 40"
paste ('Number of default fields = ', length(default_fields('projects')))
## [1] "Number of default fields = 10"
paste ('Number of available fields = ',length(available_fields('projects')))
## [1] "Number of available fields = 22"
# Table of dafault field in projects:
x <- data.frame(default_fields('projects'))
x
## default_fields..projects..
## 1 dbgap_accession_number
## 2 disease_type
## 3 intended_release_date
## 4 name
## 5 primary_site
## 6 project_autocomplete
## 7 project_id
## 8 releasable
## 9 released
## 10 state
kable(x, "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
| default_fields..projects.. |
|---|
| dbgap_accession_number |
| disease_type |
| intended_release_date |
| name |
| primary_site |
| project_autocomplete |
| project_id |
| releasable |
| released |
| state |
# Table of all available field in projects:
kable(data.frame(available_fields('projects')), "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")%>%
scroll_box(width = "500px", height = "300px")
| available_fields..projects.. |
|---|
| dbgap_accession_number |
| disease_type |
| intended_release_date |
| name |
| primary_site |
| program.dbgap_accession_number |
| program.name |
| program.program_id |
| project_autocomplete |
| project_id |
| releasable |
| released |
| state |
| summary.case_count |
| summary.data_categories.case_count |
| summary.data_categories.data_category |
| summary.data_categories.file_count |
| summary.experimental_strategies.case_count |
| summary.experimental_strategies.experimental_strategy |
| summary.experimental_strategies.file_count |
| summary.file_count |
| summary.file_size |
# JSON Table of default project fields:
all_field_project <- projects()%>%results_all()
jsonedit(all_field_project,mode = "view")
# JSON Table of Disease type of all the projects and sub projects:
jsonedit(all_field_project$disease_type,mode = "view")
# Projects and No. of cases:
res = cases() %>% facet("project.project_id") %>% aggregations() %>% data.frame()
colnames(res) <- c("Project", "NoOfCases")
kable(res, "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")%>%
scroll_box(width = "200px", height = "500px")
| Project | NoOfCases |
|---|---|
| FM-AD | 18004 |
| TARGET-NBL | 1127 |
| TCGA-BRCA | 1098 |
| TARGET-AML | 988 |
| TARGET-WT | 652 |
| TCGA-GBM | 617 |
| TCGA-OV | 608 |
| TCGA-LUAD | 585 |
| TCGA-UCEC | 560 |
| TCGA-KIRC | 537 |
| TCGA-HNSC | 528 |
| TCGA-LGG | 516 |
| TCGA-THCA | 507 |
| TCGA-LUSC | 504 |
| TCGA-PRAD | 500 |
| TCGA-SKCM | 470 |
| TCGA-COAD | 461 |
| TCGA-STAD | 443 |
| TCGA-BLCA | 412 |
| TARGET-OS | 381 |
| TCGA-LIHC | 377 |
| TCGA-CESC | 307 |
| TCGA-KIRP | 291 |
| TCGA-SARC | 261 |
| TCGA-LAML | 200 |
| TCGA-ESCA | 185 |
| TCGA-PAAD | 185 |
| TCGA-PCPG | 179 |
| TCGA-READ | 172 |
| TCGA-TGCT | 150 |
| TCGA-THYM | 124 |
| TCGA-KICH | 113 |
| TCGA-ACC | 92 |
| TCGA-MESO | 87 |
| TCGA-UVM | 80 |
| TARGET-RT | 75 |
| TCGA-DLBC | 58 |
| TCGA-UCS | 57 |
| TCGA-CHOL | 51 |
| TARGET-CCSK | 13 |
# Bar plot of Projects and No. of cases:
figure(width = 900, ylab = "No. Of Cases", xlab = "Projects", title = "Project~Cases(All)") %>% ly_bar(Project, NoOfCases, data = res,hover = res$NoOfCases ) %>% theme_axis("x", major_label_orientation = 90)
# Bar plot of Projects and No. of cases without the FM-AD Project:
res <- res[-1,]
figure(width = 900, ylab = "No. Of Cases", xlab = "Projects", title = "Project~Cases(w/o FM-AD)") %>% ly_bar(res$Project, res$NoOfCases, hover = res$NoOfCases) %>% theme_axis("x", major_label_orientation = 90)
# No. of cases in the HCC Liver cancer project:
cases() %>% GenomicDataCommons::filter( ~ project.project_id == 'TCGA-LIHC' ) %>% GenomicDataCommons::count()
## [1] 377
#Different file type:
q = files() %>%
GenomicDataCommons::select(available_fields('files')) %>%
GenomicDataCommons::filter(~ cases.project.project_id=='TCGA-LIHC' &
data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations() %>% kable()%>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")
|
# Total number of HiSeq files in Liver Hepatocellular Carcinoma
files() %>% GenomicDataCommons::filter( ~ cases.project.project_id == 'TCGA-LIHC' & analysis.workflow_type == 'HTSeq - FPKM-UQ') %>% GenomicDataCommons::count()
## [1] 424
# Data type of files available in TCGA dataset:
files() %>% facet('data_type') %>% aggregations()%>%kable()%>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")%>%
scroll_box(width = "400px", height = "500px")
|
# Analysis Workflow_type available in TCGA dataset:
files() %>% facet('analysis.workflow_type') %>% aggregations()%>%kable()%>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")%>%
scroll_box(width = "400px", height = "500px")
|
The LIHC (Liver Hepatocellular Carcinoma) project’s RNAseq and clinical folders are downloaded and named RNA and Clinical respectively. A sample of RNAseq dataset with first 5 rows and 5 columns.
mRNAseq <- read.table('RNA/LIHC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt',nrows=20533, header=T,row.names=1,sep='\t')
mRNAseq[1:5,1:5]
## TCGA.2V.A95S.01A.11R.A37K.07 TCGA.2Y.A9GS.01A.12R.A38B.07
## gene_id normalized_count normalized_count
## ?|100130426 0.0000 0.0000
## ?|100133144 1.5051 26.4120
## ?|100134869 3.7074 2.6663
## ?|10357 90.1124 71.0054
## TCGA.2Y.A9GT.01A.11R.A38B.07 TCGA.2Y.A9GU.01A.11R.A38B.07
## gene_id normalized_count normalized_count
## ?|100130426 0.0000 0.0000
## ?|100133144 0.0000 5.7222
## ?|100134869 4.4833 5.1216
## ?|10357 95.5122 61.6679
## TCGA.2Y.A9GV.01A.11R.A38B.07
## gene_id normalized_count
## ?|100130426 0.0000
## ?|100133144 11.4975
## ?|100134869 5.4230
## ?|10357 104.4670
mRNAseq <-mRNAseq[-1,]
mRNAseq[1:5,1:5]
## TCGA.2V.A95S.01A.11R.A37K.07 TCGA.2Y.A9GS.01A.12R.A38B.07
## ?|100130426 0.0000 0.0000
## ?|100133144 1.5051 26.4120
## ?|100134869 3.7074 2.6663
## ?|10357 90.1124 71.0054
## ?|10431 1017.1038 639.2311
## TCGA.2Y.A9GT.01A.11R.A38B.07 TCGA.2Y.A9GU.01A.11R.A38B.07
## ?|100130426 0.0000 0.0000
## ?|100133144 0.0000 5.7222
## ?|100134869 4.4833 5.1216
## ?|10357 95.5122 61.6679
## ?|10431 742.4344 1186.9807
## TCGA.2Y.A9GV.01A.11R.A38B.07
## ?|100130426 0.0000
## ?|100133144 11.4975
## ?|100134869 5.4230
## ?|10357 104.4670
## ?|10431 878.1726
Totalgenes = paste0("Total genes = ", dim(mRNAseq)[1])
# Turn the mRNAseq data into matrix format
x <- as.matrix(mRNAseq)
# Turn each row in numeric format
x <- t(apply(x,1, as.numeric))
# Count how many 0s are in each row in the gene matrix:
r <- as.numeric(apply(x,1,function(i) sum(i == 0)))
# list is created with rows which have more than half of row data with no counts
remove <- which(r > dim(x)[2]*0.5)
# mRNAseq datamatrix is reduced by subtracting the remove list
mRNAseq <- mRNAseq[-remove,]
# Visualizing low count gene removal
text1 = paste0("Removed genes = ", length(remove))
text2 = paste0("Preserved genes = ", dim(mRNAseq)[1])
Totalcases = paste0("Total cases = ", dim(mRNAseq)[2])
plot1 = figure(width = c(700), xlab = "Genes with no. of 0 counts") %>%ly_hist(r,breaks = 50) %>%
ly_abline(v = dim(x)[2]*0.5) %>%
ly_text(x=230, y=14000, text=text1, color = c("red"))%>%
ly_text(x=30, y=14000, text=text2, color = c("red")) %>%
ly_rect(250, 6000, 400, 8000, fill_alpha=0, line_color = c("black"))%>%
ly_text(x=255, y=7000, text = Totalgenes)%>%
ly_text(x=255, y=6000, text = Totalcases)
plot1
Here we can see that the genes that had counts in more than 50% of the patient population get preserved (16910 genes) and the other 423 genes are removed from the dataset since they have 0 counts in more than 50% of the patient population.
Here two different data transformation technique are performed to examin and select which to proceed with for further analysis. The two data transformation techniques are:
rna <- mRNAseq
rna <- as.matrix(rna)
rna <- apply(rna,1,as.numeric)
rna <-t(rna)
# Density plot of Mean and SD of column of 0 filtered matrix (Between samples):
figure(ylab = "Stan. Dev. of Columns(mRNAseq)", xlab = "Means of Columns(mRNAseq)", title = "Standard Deviation vs Mean of Columns of mRNAseq") %>% ly_hexbin(colMeans(rna), colSds(rna))
# Density plot of Mean and SD of row of 0 filtered matrix (Between Genes):
figure(ylab = "Stan. Dev. of Rows(mRNAseq)", xlab = "Means of Rows(mRNAseq)", title = "Standard Deviation vs Mean of Rows of mRNAseq") %>% ly_hexbin(rowMeans(rna), rowSds(rna))
# Plotting gene1 vs gene2 across samples:
plot(rna[,1],rna[,2],pch=16, cex=0.3)
# QQ Plotting gene1 vs gene2 across samples:
qqplot(rna[,1],rna[,2],pch=16, cex=0.3)
Looking between samples, the range of SD and Mean vary over 6 and 5 magnitude of order respectively and the same goes for when looking between genes as it varies over the order of 6. Also the gene1 vs gene2 plot and the QQ plot of itself gives no useful insight as the datatpoints are very skewed towards low count genes.
# puseudocount and log base 2 transformation of gene counts:
log.mRNA.one <- log2(rna + 1)
# Density plot of Mean and SD of column of transformed matrix
figure(ylab = "Stan. Dev. of Columns(Log.one.(mRNAseq))", xlab = "Means of Columns(Log.one.(mRNAseq))", title = "Standard Deviation vs Mean of Columns of transformed mRNAseq (Between samples)") %>% ly_hexbin(colMeans(log.mRNA.one), colSds(log.mRNA.one))
# Density plot of Mean and SD of rows of transformed matrix
figure(ylab = "Stan. Dev. of Rows(Log.one.(mRNAseq)", xlab = "Means of Rows(Log.one.(mRNAseq))", title = "Standard Deviation vs Mean of Rows of mRNAseq (Between Genes)") %>% ly_hexbin(rowMeans(log.mRNA.one), rowSds(log.mRNA.one))
plot(log.mRNA.one[,1],log.mRNA.one[,2],pch=16, cex=0.3)
qqplot(log.mRNA.one[,1],log.mRNA.one[,2],pch=16, cex=0.3)
When there is an uneven number of samples, such as uneven number of normal to cancer patients, voom function is perferred.
# Number of Normal cell samples:
n_index <- which(substr(colnames(mRNAseq),14,14) == '1')
length(n_index)
## [1] 50
# Number of tumor cell samples:
t_index <- which(substr(colnames(mRNAseq),14,14) == '0')
length(t_index)
## [1] 373
# Applying voom function from limma package to normalize the data
vm <- function(x){
cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1, 0))
d <- model.matrix(~1+cond)
x <- t(apply(x,1,as.numeric))
ex <- voom(x,d,plot=F)
return(ex$E)
}
rna_vm <- vm(mRNAseq)
colnames(rna_vm) <- colnames(mRNAseq)
dim(rna_vm)
## [1] 16910 423
# Density plot of Mean and SD of column of 0 filtered matrix:
figure(ylab = "Stan. Dev. of Columns(mRNAseq)", xlab = "Means of Columns(mRNAseq)", title = "Standard Deviation vs Mean of Columns of mRNAseq (Between samples)") %>% ly_hexbin(colMeans(rna_vm), colSds(rna_vm))
# Density plot of Mean and SD of row of 0 filtered matrix:
figure(ylab = "Stan. Dev. of Rows(mRNAseq)", xlab = "Means of Rows(mRNAseq)", title = "Standard Deviation vs Mean of Rows of mRNAseq (Between Genes)") %>% ly_hexbin(rowMeans(rna_vm), rowSds(rna_vm))
figure(ylab = "Stan. Dev. of Columns(mRNAseq)", xlab = "Means of Columns(mRNAseq)", title = "Standard Deviation vs Mean of Columns of mRNAseq (Between samples)") %>% ly_hexbin(colMeans(rna_vm), colSds(rna_vm)) %>% ly_hexbin(rowMeans(rna_vm), rowSds(rna_vm))
plot(rna_vm[,1],rna_vm[,2],pch=16, cex=0.3)
qqplot(rna_vm[,1],rna_vm[,2],pch=16, cex=0.3)
In log2 +1 transformation and voom function, the range of SD and Mean between samples and genes are extremely small compared to the previous raw dataset analysis. This transformation lets us compare the genes and samples. Since the variation is distributed more evenly in voom normalization, this analysis will adopt the voom normalized data to perform further analysis.
# Reordering the matrix columns to have the normal cases first and then the tumor cases second: (this is for the heatmap plot)
rna_vm <- rna_vm[ , c(n_index, t_index)]
# just checking again, when printing both n_index and t_index, they should have a continues numbering.
# Number of Normal cell samples:
n_index <- which(substr(colnames(rna_vm),14,14) == '1')
# Number of tumor cell samples:
t_index <- which(substr(colnames(rna_vm),14,14) == '0')
sampleDists <- dist(t(rna_vm))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
annotdf <- data.frame(row.names = colnames(rna_vm), category = c(rep("Normal", length(n_index)), rep("Tumor", length(t_index))))
rownames(sampleDistMatrix) <- colnames(rna_vm)
colnames(sampleDistMatrix) <- c(rep("Normal", length(n_index)), rep("Tumor", length(t_index)))
pheatmap(sampleDistMatrix,
col = colors,
gaps_row=length(n_index),
gaps_col=length(n_index),
annotation_row = annotdf,
cluster_rows = FALSE,
cluster_cols = FALSE,
show_rownames = FALSE,
show_colnames = FALSE,
annotation_names_row = FALSE)
In the above heatmap, we can see that the normal patients without cancer are more alike in terms of gene expression than within cancer patients.
autoplot(prcomp(t(rna_vm)), data = t(rna_vm), colour = c(rep("black", length(n_index)), rep("red", length(t_index))))
The above PCA plot of PC1 and PC2 datasets give a biologically relavant insight. The red dots denots cancer patients and the black dots denots normal patients. The cancer points are clusted around the origin, while the normal points are clusted on the fourth quadrant outside the cancer cluster. This tells us that normal gene expression produces almost the same variance while the variance of cancer gene expression are very random and the reason behind it could be that in Liven carcinoma cancer, there might be different sub categories of the cancer.
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
annotdf <- data.frame(row.names = colnames(rna_vm), category = c(rep("Normal", length(n_index)), rep("Tumor", length(t_index))))
rownames(sampleDistMatrix) <- colnames(rna_vm)
colnames(sampleDistMatrix) <- c(rep("Normal", length(n_index)), rep("Tumor", length(t_index)))
pheatmap(sampleDistMatrix,
col = colors,
gaps_row=length(n_index),
gaps_col=length(n_index),
annotation_row = annotdf,
cluster_rows = TRUE,
cluster_cols = FALSE,
show_rownames = FALSE,
show_colnames = FALSE,
annotation_names_row = FALSE)
Here the Z- score for each gene in each sample is calculated by comparing the normal and cancer samples from the cases. This allows us to consider individual gene significant values instead of cumulating pvalues for all genes combined to find the significant genes. By calculating the z score, we can find if a particular gene is significantly up or down regulated for each patient.
# Calculating Z Score
scal <- function(x,y){
mean_n <- rowMeans(y) # mean of normal
sd_n <- apply(y,1,sd) # SD of normal
res <- matrix(nrow=nrow(x), ncol=ncol(x))
colnames(res) <- colnames(x)
rownames(res) <- rownames(x)
for(i in 1:dim(x)[1]){
for(j in 1:dim(x)[2]){
res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
}
}
return(res)
}
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])
# Setting the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,'\\|'))[[1]])
# Dimensions of Z score for each genes in each tumor samples in terms of normal sample values
dim(z_rna)
## [1] 16910 373
z_rna[1:5,1:5]
## TCGA.2V.A95S.01A.11R.A37K.07 TCGA.2Y.A9GS.01A.12R.A38B.07
## ? 0.4781493 3.2921169
## ? 1.4040622 1.2730221
## ? 0.9726639 0.7364384
## ? 1.2534915 -0.0275773
## ? 2.7221215 2.7672378
## TCGA.2Y.A9GT.01A.11R.A38B.07 TCGA.2Y.A9GU.01A.11R.A38B.07
## ? -1.1816147 1.4699895
## ? 1.3177359 1.5580328
## ? 0.3042798 -0.9454642
## ? -1.1875702 1.2429450
## ? 1.7615871 3.5120054
## TCGA.2Y.A9GV.01A.11R.A38B.07
## ? 2.2556587
## ? 1.7531176
## ? 1.3555924
## ? 0.4345617
## ? 3.6989549
r <- as.numeric(apply(z_rna,1,function(i) sum(abs(i) > 1.96)))
figure(width = c(700)) %>%ly_hist(r,breaks = 500)
length(which(r==0))
## [1] 17
# Clinical Data upload
clinical <- t(read.table('Clinical/LIHC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t'))
# Dimensions of Clinical data:
library(kableExtra)
dim(clinical)
## [1] 377 407
# Sample of first 5 rows and columns of clinical data:
temp <- as.data.frame(clinical[1:5,1:5])
kable(temp, "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left") %>%
scroll_box(width = "900px", height = "300px")
| admin.month_of_dcc_upload | admin.patient_withdrawal.withdrawn | admin.project_code | admin.year_of_dcc_upload | patient.ablations | |
|---|---|---|---|---|---|
| V2 | 1 | false | tcga | 2016 | NA |
| V3 | 1 | false | tcga | 2016 | NA |
| V4 | 1 | false | tcga | 2016 | NA |
| V5 | 1 | false | tcga | 2016 | NA |
| V6 | 1 | false | tcga | 2016 | NA |
# Number of Samples in clinical data:
length(colnames(clinical))
## [1] 407
# Number of Samples in z_rna data:
length(colnames(z_rna))
## [1] 373
# Number of clinical cases we can use from the available:
clinical <- as.data.frame(clinical)
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna))
## [1] 0
Since there are multiple fields in clinical data set and we need only specific fields for analysis, matrix will be reduced by combining those fields. In the clinical data set, we need three main fields, tumor event, death event for survival analysis and follow up event for censoring. For new tumor event, There are 8 fields which will all be combined by taking the highest value of all the fields.
# Number of columns with new tumor event data:
ind_keep <- grep('days_to_new_tumor_event_after_initial_treatment',colnames(clinical))
length(ind_keep)
## [1] 8
# Column names of all new tumor event fields:
kable(colnames(clinical)[ind_keep], "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
| x |
|---|
| patient.follow_ups.follow_up-2.new_tumor_events.new_tumor_event-2.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up-2.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up-3.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up-4.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up.new_tumor_events.new_tumor_event-2.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up.new_tumor_events.new_tumor_event-3.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up.new_tumor_events.new_tumor_event-4.days_to_new_tumor_event_after_initial_treatment |
| patient.follow_ups.follow_up.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment |
The same will be performed for the 5 fields that contain death information.
# Number of columns with days to death field:
inddeath_keep <- grep('days_to_death',colnames(clinical))
length(inddeath_keep)
## [1] 5
# Column names of all days to death fields:
kable(colnames(clinical)[inddeath_keep], "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
| x |
|---|
| patient.days_to_death |
| patient.follow_ups.follow_up-2.days_to_death |
| patient.follow_ups.follow_up-3.days_to_death |
| patient.follow_ups.follow_up-4.days_to_death |
| patient.follow_ups.follow_up.days_to_death |
Similarly, for the follow up fields too.
# Number of columns with last day of follow up field:
indfollow_keep <- grep('days_to_last_followup',colnames(clinical))
length(indfollow_keep)
## [1] 5
# Column names of all last day to follow up fields:
kable(colnames(clinical)[indfollow_keep], "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
| x |
|---|
| patient.days_to_last_followup |
| patient.follow_ups.follow_up-2.days_to_last_followup |
| patient.follow_ups.follow_up-3.days_to_last_followup |
| patient.follow_ups.follow_up-4.days_to_last_followup |
| patient.follow_ups.follow_up.days_to_last_followup |
Now that we know what fields are we can combine for the final analysis, a new matrix is generated by the following code:
# For New tumore event:
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
m <- min(new_tum[i,],na.rm=T)
new_tum_collapsed <- c(new_tum_collapsed,m)
} else {
new_tum_collapsed <- c(new_tum_collapsed,'NA')
}
}
# For death events:
ind_keep <- grep('days_to_death',colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
if ( sum ( is.na(death[i,])) < dim(death)[2]){
m <- max(death[i,],na.rm=T)
death_collapsed <- c(death_collapsed,m)
} else {
death_collapsed <- c(death_collapsed,'NA')
}
}
# For follow up events:
ind_keep <- grep('days_to_last_followup',colnames(clinical))
fl <- as.matrix(clinical[,ind_keep])
fl_collapsed <- c()
for (i in 1:dim(fl)[1]){
if ( sum (is.na(fl[i,])) < dim(fl)[2]){
m <- max(fl[i,],na.rm=T)
fl_collapsed <- c(fl_collapsed,m)
} else {
fl_collapsed <- c(fl_collapsed,'NA')
}
}
# Combining the fields together to create an analysis matrix:
all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
colnames(all_clin) <- c('new_tumor_days', 'death_days', 'followUp_days')
#Dimensions of all_clin matrix:
dim(all_clin)
## [1] 377 3
# Sample of clinical data
kable(head(all_clin), "html") %>%
kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
| new_tumor_days | death_days | followUp_days |
|---|---|---|
| NA | NA | NA |
| NA | 724 | NA |
| NA | 1624 | NA |
| NA | NA | 1939 |
| NA | 2532 | NA |
| NA | 1271 | NA |
Since some patients sometimes discontinue their treatment or for some reason, the death or tumor event is not recorded, these data are removed from the analysis since we would like to keep only the cases where a certain death event or survival event is known. Therefore, for each cases, if the follow up event is the longest date available compared to the death and tumor event, then these rows will not be considered.
# vector with time to new tumor containing data to censor for new_tumor
all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
all_clin$new_time[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}
# vector with time to death containing values to censor for death
all_clin$new_death <- c()
for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
all_clin$new_death[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$death_days))[i]),
as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i])
}
dim(all_clin)
## [1] 377 5
all_clin$death_event <- ifelse(clinical$patient.follow_ups.follow_up.vital_status == 'alive', 0,1)
# Adding row.names to clinical
rownames(all_clin) <- clinical$IDs